source("code.R")
source("cna.R")
library("kableExtra")
options(dplyr.summarise.inform = FALSE)
PDID=params$PDID
treemodel=params$TREEMODEL
MUTCOUNTBIN="./MUTCOUNTBINS.RDS"
bins=readRDS(MUTCOUNTBIN)
PD=readRDS("PD6629.RDS")
# Pathogenic substitution types
CV=c("missense","nonsense","ess_splice","frameshift","inframe","cna","loh","start_lost","nc_ess_splice","stop_lost","complex_sub")
## Read in list of genes of interest (from other pubs or dnds - here taken from prognostics).
GENES=readLines("GENES.txt")
refcolony=PD$refcolony
CENSOR_GERMLINE=TRUE
| patient | ID | age_at_sample_exact | cell_type | phase | BaitLabel | |
|---|---|---|---|---|---|---|
| 3 | PD6629 | PD6629dc | 55.05270 | PB whole | Recapture | PD6629dc |
| 4 | PD6629 | PD6629dd | 58.83094 | PB Gran | Recapture | PD6629dd |
| 1 | PD6629 | COLONY60 | 60.26831 | BFU-E-Colony | Colony | NA |
| 5 | PD6629 | PD6629de | 60.26831 | PB Gran | Recapture | PD6629de |
| 2 | PD6629 | COLONY62 | 62.28063 | BFU-E-Colony | Colony | NA |
| 6 | PD6629 | PD6629df | 63.41136 | PB Gran | Recapture | PD6629df |
| 7 | PD6629 | PD6629dg | 64.73374 | PB Gran | Recapture | PD6629dg |
| 8 | PD6629 | PD6629dh | 65.86448 | PB Gran | Recapture | PD6629dh |
tree=plot_basic_tree(PD$pdx,label = PD$patient,style="classic",cv=CV,genes=GENES)
The nodes in this plot can be cross-referenced with nodes specified in subsequent results. The plot also serves to give an idea of what the topology at the top of the tree looks like.
tree=plot_basic_tree(expand_short_branches(PD$pdx,prop = 0.1),label = PD$patient,style="classic",cv=CV,genes=GENES)
node_labels(tree)
Note that the different colours on the tree indicate the separately fitted mutation rate clades.
##Run null and alt models
NITER=2000
##The following populates a list PD$fit$<treemodel>$<altmodel|nullmodel>. The function wraptreefit wraps a call to the rtreefit package.
PD=wraptreefit(PD,niter=NITER,b.fit.null = FALSE,method=treemodel)
PD=wraptreefit(PD,niter=NITER,b.fit.null = TRUE,method=treemodel)
par(mfcol=c(1,2))
tree=plot_tree(PD$fit[[treemodel]]$nullmodel$ultratree,cex.label=0);title(sprintf("%s:Ultrametric tree with single rate fitted",PD$patient))
tree=plot_tree(get_colored_markup_tree(PD$fit[[treemodel]]$altmodel$ultratree,PD$nodes$node[PD$nodes$status>=0]),cex.label=0);title(sprintf("%s:Ultrametric tree with multiple rates fitted",PD$patient))
| node | driver | status | child_count | type | colony_count | mean_lambda_rescaled | correction | sd_rescaled | lb_rescaled | ub_rescaled | median_rescaled | p_lt_wt |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| -1 | WT | 1 | -1 | local | 1 | 17.89996 | 1.014746 | 1.0455655 | 15.992666 | 20.04496 | 17.88861 | NA |
| 63 | DNMT3A | 1 | 56 | local | 21 | 18.30081 | 1.014746 | 0.3208797 | 17.689680 | 18.96030 | 18.29556 | 0.35475 |
| 68 | JAK2:DNMT3A | 1 | 35 | local | 25 | 18.05795 | 1.014746 | 0.5876457 | 16.983898 | 19.22153 | 18.04647 | 0.44325 |
| 59 | DNMT3A | 0 | 1 | local | 1 | 20.23810 | 1.014746 | 2.9295554 | 14.129344 | 26.67156 | 20.01563 | 0.21225 |
| 58 | CBL | 0 | 1 | local | 1 | 14.67284 | 1.014746 | 2.5882421 | 9.164435 | 20.11756 | 14.84231 | 0.89775 |
| 88 | 9pUPD:JAK2:DNMT3A | 0 | 2 | local | 2 | 19.00392 | 1.014746 | 1.3265712 | 16.560183 | 21.93145 | 18.90989 | 0.26450 |
| 76 | TET2:JAK2:DNMT3A | 1 | 8 | local | 8 | 17.20811 | 1.014746 | 0.9930776 | 15.312261 | 19.26226 | 17.18853 | 0.67675 |
All ages are in terms of post conception years. The vertical red lines denote when colonies were sampled and blue lines when targeted follow up samples were taken.
| patient | node | driver | child_count | lower_median | upper_median | lower_lb95 | lower_ub95 | upper_lb95 | upper_ub95 | N | group | age_at_diagnosis_pcy | max_age_at_sample | min_age_at_sample |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PD6629 | 63 | DNMT3A | 56 | 0.0087052 | 7.021722 | 0.0061889 | 0.0223216 | 5.960489 | 8.224788 | 8 | DNMT3A | 54.8063 | 66.59274 | 55.78097 |
| PD6629 | 68 | JAK2 | 35 | 14.0861319 | 29.826768 | 12.7732820 | 15.4934830 | 28.750732 | 30.938651 | 8 | JAK2 | 54.8063 | 66.59274 | 55.78097 |
| PD6629 | 88 | 9pUPD | 2 | 32.9166133 | 46.100157 | 31.8524016 | 33.9989184 | 44.719007 | 47.515371 | 8 | 9pUPD | 54.8063 | 66.59274 | 55.78097 |
| PD6629 | 76 | TET2 | 8 | 36.4302883 | 46.652268 | 35.3107168 | 37.5932440 | 45.653070 | 47.581811 | 8 | TET2 | 54.8063 | 66.59274 | 55.78097 |
| patient | node | driver | child_count | lower_median | upper_median | lower_lb95 | lower_ub95 | upper_lb95 | upper_ub95 | N | group | age_at_diagnosis_pcy | max_age_at_sample | min_age_at_sample |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PD6629 | 63 | DNMT3A | 56 | 0.0086408 | 7.262048 | 0.0061942 | 0.02191 | 5.951499 | 8.765735 | 8 | DNMT3A | 54.8063 | 66.59274 | 55.78097 |
| PD6629 | 68 | JAK2 | 35 | 14.2510600 | 29.707781 | 12.6407896 | 15.93306 | 27.818446 | 31.597035 | 8 | JAK2 | 54.8063 | 66.59274 | 55.78097 |
| PD6629 | 76 | TET2 | 8 | 36.1437225 | 46.025114 | 34.3245419 | 37.89799 | 44.231836 | 47.729015 | 8 | TET2 | 54.8063 | 66.59274 | 55.78097 |
| PD6629 | 88 | 9pUPD | 2 | 32.7786694 | 46.309379 | 30.9287726 | 34.59708 | 44.321100 | 48.149647 | 8 | 9pUPD | 54.8063 | 66.59274 | 55.78097 |
The timing methods require an rough approximation for the number of expected detectable mutations, \(L\), in the LOH/CNA region for the duration of the branch.
Firstly we estimate a local relative somatic mutation rate for mutations detectable by CaveMan in autosomal regions. The rate is measured across a panel of samples consisting of all those colonies in all 12 patients that do not exhibit copy number aberrations. The genome is divided into 100Kb bins and the number of passed somatic mutations is counted across all samples in the panel, to give a count \(c_i\) for bin \(b_i\). The probability that a given mutation occurs in bin \(i\) is estimated by \(p(mut \in b_i)=\frac{c_i}{\sum_j c_j}\) and standard error \(p_{se}=\frac{p(mut \in b_i)}{\sqrt{c_j}}\). For a given copy number region \(C\) then \(p(mut \in C)=\sum_{b_i \in C} p(mut \in b_i)\). For a branch of duration \(t\) and with global mutation rate \(\lambda\) then \(E(L)=\lambda t p(mut \in C)\) and \(Var(L)=\lambda^2 t^2 \sum_{b_i \in C} p_{se}(mut \in b_i)^2\) where here we are not modelling errors in \(t\) and \(\lambda\).
All somatic mutations that occur prior to the LOH event, occurring a fraction \(x\) along the branch, will be homozygous with detection sensitivity \(s_{\text{HOM}}\) and those after will be heterozygous with detection sensitivity \(s_{\text{HET}}\). We model the mutations as arriving at a constant rate along the branch and fit the following model for \(x\) :
\[ N_{\text{HET}} \sim \text{Poisson} \left( (1-x) L s_{\text{HET}} \right) \] \[ N_{\text{HOM}} \sim \text{Poisson} \left( x L s_{\text{HOM}} \right) \]
with priors \(x \sim \text{Uniform}(0,1)\) and \(L \sim \mathcal{N}(E(L),Var(L))\) and where \(s_{\text{HOM}}=0.5\) (assuming perfect detection of homozygous mutant variants) and \(s_{\text{HET}}\) is estimated from germline SNPs as previously discussed.
Somatic mutations that occur prior to the CNA event, occurring a fraction \(x\) along the branch, have an equal chance of exhibiting VAF=1/3 or VAF=2/3, whereas those occurring after the event will always have VAF=1/3.
\[ N_{2/3} \sim \text{Poisson} \left( \frac{x L s_{1/3} }{2} \right) \] \[ N_{1/3} \sim \text{Poisson} \left( s_{2/3} \left( \frac{x L }{2} + (1-x) L \right) \right) \]
Where the priors are as in the LOH model above. The detection sensitivities \(s_{1/3}\) and \(s_{2/3}\) are approximated by \(s_{het}\) because of the additional sequencing depth afforded by the duplication, all though it is acknowleged that this approach is likely to underestimate the \(s_{2/3}\). Unlike the LOH case, the value of \(x\) will be relatively unaffected by \(L\) because of the similarity of \(s_{1/3}\) and \(s_{2/3}\).
The Bernoulli trial probability of the observing the mutant allele on a given read given that true VAF is \(\mu\) and assuming an error rate \(\epsilon\) is as follows:
\[ p(mut)=p(mut|\text{mutant read})p(\text{mutant read})+p(mut|\text{wild type read})p(\text{wild type read}) \] \[ p(mut)=(1-\epsilon)\mu+\frac{\epsilon}{3}(1-\mu) \]
Where in the above "mutant read" etc denotes that the read originated from a mutant chromosome.
Where a variant can be classified as having one of 2 VAFs (\(\mu_1\) or \(\mu_2\), with each having same prior probability:
\[ p(VAF=\mu_1|data)=\frac{p_{\text{binom}}(k=m|n=depth,p=(1-\epsilon)\mu_1+\frac{\epsilon}{3}(1-\mu_1))}{p_{\text{binom}}((k=m|n=depth,p=(1-\epsilon)\mu_1+\frac{\epsilon}{3}(1-\mu_1))+p_{\text{binom}}(k=m|n=depth,p=(1-\epsilon)\mu_2+\frac{\epsilon}{3}(1-\mu_2))} \]
If desired this can be further simplified; setting \(\mu^{*}_{1}=(1-\epsilon)\mu_1+\frac{\epsilon}{3}(1-\mu_1))\) and \(\mu^{*}_{2}=(1-\epsilon)\mu_2+\frac{\epsilon}{3}(1-\mu_2))\) then
\[ p(VAF=\mu_1|data)=\frac{\mu^{*m}_{1} (1-\mu^{*}_{1})^{d-m}}{\mu^{*m}_{1} (1-\mu^{*}_{1})^{d-m}+\mu^{*m}_{2} (1-\mu^{*}_{2})^{d-m}} \]
We use the above to classify both duplicated (VAF=2/3) vs unduplicated (VAF=1/3) and also heterozygous(VAF=0.5) vs homozygous mutant (VAF=1). Only variants where one of these probabilities exceeds 0.95 is classified, the others remain unclassified and are not included in the model.
Note that when applied to shared branches we typically have high accuracy in classifying variants because we consider reads that are aggregated across all colonies that share the branch.
library(karyoploteR)
## Warning: package 'karyoploteR' was built under R version 4.0.3
## Warning: package 'regioneR' was built under R version 4.0.3
## Warning: package 'GenomicRanges' was built under R version 4.0.3
## Warning: package 'BiocGenerics' was built under R version 4.0.5
## Warning: package 'S4Vectors' was built under R version 4.0.3
## Warning: package 'IRanges' was built under R version 4.0.3
## Warning: package 'GenomeInfoDb' was built under R version 4.0.5
gr=GRanges(sprintf("chr%s",bins$chr), IRanges(start=bins$start,end=bins$end))
mcols(gr)=data.frame(y=bins$prob)
ymax=max(bins$prob)
ymin=0
kp <- plotKaryotype(chromosomes=sprintf("chr%s",1:22),plot.type=1)
kpAbline(kp, h=0, ymin=ymin, ymax=ymax, lty=2, col="#666666")
kpAxis(kp, ymin = ymin, ymax=ymax)
kpLines(kp, data=gr, ymin=ymin, ymax=ymax)
We apply the above model to any copy number neutral LOH events in the tree:
hethomres=get_hethom(PD,refcolony,b_do_plot=TRUE)
## Warning in FUN(X[[i]], ...): Germline info is censored so not plotting BAFs..
## Number of muts overlapping loh shared in tree = 2
cnatimings=hethomres$hethom
cnatimings2=cnatimings %>% left_join(globaltimings,by="node")
cnatimings2$mean_loh_event=cnatimings2$xmean*(cnatimings2$upper_mean-cnatimings2$lower_mean)+cnatimings2$lower_mean
cnatimings2$lower_loh_event=cnatimings2[["x2.5."]]*(cnatimings2$upper_mean-cnatimings2$lower_mean)+cnatimings2$lower_mean
cnatimings2$upper_loh_event=cnatimings2[["x97.5."]]*(cnatimings2$upper_mean-cnatimings2$lower_mean)+cnatimings2$lower_mean
cnatimings2$t_before_end=(1-cnatimings2$xmean)*(cnatimings2$upper_mean-cnatimings2$lower_mean)
cnatimings2$t_before_end_lower=(1-cnatimings2[["x97.5."]])*(cnatimings2$upper_mean-cnatimings2$lower_mean)
cnatimings2$t_before_end_upper=(1-cnatimings2[["x2.5."]])*(cnatimings2$upper_mean-cnatimings2$lower_mean)
kable(cnatimings2, "html") %>% kable_styling("striped") %>% scroll_box(width="95%")
| nhet | nhom | label | node | het.sensitivity | chr | start | end | kb | count_in_bin | count_se | pmut | pmut_se | xmean | xse_mean | xsd | x2.5. | x50. | x97.5. | xn_eff | xRhat | lmean | lse_mean | lsd | l2.5. | l50. | l97.5. | ln_eff | lRhat | patient | driver | status | driver2 | driver3 | child_count | chknode | lower_lb95 | lower_ub95 | lower_median | lower_mean | upper_lb95 | upper_ub95 | upper_median | upper_mean | lower_mutcount_adj | upper_mutcount_adj | lower_mutcount | upper_mutcount | group | N | max_age_at_sample | min_age_at_sample | age_at_diagnosis_pcy | totcolonies | col | mean_loh_event | lower_loh_event | upper_loh_event | t_before_end | t_before_end_lower | t_before_end_upper |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | LOH_9pUPD | 88 | 0.9102877 | 9 | 0 | 33393006 | 33300000 | 6654 | 81.57205 | 0.0145321 | 0.0001782 | 0.7988957 | 0.0010226 | 0.1685979 | 0.3719991 | 0.8433913 | 0.9943569 | 27184.2 | 0.9999445 | 3.421252 | 0.0002609 | 0.0417497 | 3.339451 | 3.42126 | 3.502349 | 25616.1 | 1.000053 | PD6629 | 9pUPD | 0 | 9pUPD | 9pUPD:JAK2:DNMT3A | 2 | 88 | 31.8524 | 33.99892 | 32.91661 | 32.91223 | 44.71901 | 47.51537 | 46.10016 | 46.09592 | 624.009 | 872.7397 | 613 | 836 | 9pUPD | 8 | 66.59274 | 55.78097 | 54.8063 | 61 | #4DAF4A | 43.44463 | 37.81655 | 46.02153 | 2.651298 | 0.0743964 | 8.279374 |
No drivers to add to tree!
No drivers to add to tree!
No drivers to add to tree!
No drivers to add to tree!
No drivers to add to tree!
No drivers to add to tree!
No drivers to add to tree!
Here we exclude all local CNAs and depict as color VAF plots